Get prepared
install.packages(c("tableone", "DALEX", "ggplot2", "partykit", "mlr3", "mlr3learners", "ranger", "mlr3tuning", "paradox"))
The purpose of this tutorial is to present a life cycle of a single predictive model for a problem related to binary classification (but I deliberately don’t mention logistic regression). Along the way, we will tackle various interesting topics such as model training, model verification, hyperparameter tuning, and exploratory model analysis.
The examples presented here are inspired by three textbooks books: The Elements of Statistical
Learning with mathematical foundations, The mlr3 book, presenting
software package mlr3 designed for advanced model
programming, Explanatory Model
Analysis overviews methods for model exploration and visualisation.
Note that responsible modelling requires knowledge that cannot be
learned in two days.
So, after this introduction, I highly recommend checking out these
books.
Why should I care?
Predictive models have been used throughout entire human history. Priests in Egypt were predicting when the flood of the Nile or a solar eclipse would come. Developments in statistics, increasing the availability of datasets, and increasing computing power allow predictive models to be built faster and faster.
Today, predictive models are used virtually everywhere. Planning the supply chain for a large corporation, recommending lunch or a movie for the evening, or predicting traffic jams in a city. Newspapers are full of interesting applications.
But how are such predictive models developed? In the following sections, we will go through a life cycle of a predictive model. From the concept phase, through design, training, checking, to the deployment. For this example, we will use the data set on the risk of death for Covid-19 patients after SARS-COV-2 infection. But keep in mind that the data presented here is artificial. It is generated to mirror relations in real data, but do not contain real observations for real patients. Still, it should be an interesting use-case to discuss a typical lifetime of a predictive model.
Tools
These materials are based on two R packages: mlr3 for
model training and DALEX for model visualization and
explanation. But there are more packages with similar functionalities,
for modelling other popular choices are mlr,
tidymodels and caret while for the model
explanation you will find lots of interesting features in
flashlight and iml.
The problem
The life cycle of a predictive model begins with a well-defined problem. In this example, we are looking for a model that assesses the risk of death after diagnosed covid. We don’t want to guess who will survive and who won’t. We want to construct a score that allows us to sort patients by risk of death.
Why do we need such a model? It could have many applications! Those at higher risk of death could be given more protection, such as providing them with pulse oximeters or preferentially vaccinating them.
library("tableone")
library("DALEX")
library("ggplot2")
library("partykit")
library("mlr3")
library("mlr3learners")
library("ranger")
library("mlr3tuning")
library("paradox")
set.seed(1313)
Before we build any model, even before we touch any data we should first determine for what purpose we will build a predictive model.
It is very important to define the objective before we sit down to programming, because later it is easy to get lost in setting function parameters and dealing with all these details that we need to do. It is easy to lose sight of the long-term goal.
So, first: Define the objective.
For the purpose of these exercises, I have selected data on the covid pandemic. Imagine that we want to determine the order of vaccination. In this example we want to create a predictive model that assesses individual risks because we would like to rank patients according to their risk.
To get a model that gives a best ranking we will use the AUC measure to evaluate model performance. What exactly the AUC is I’ll talk about a little later, right now the key thing is that we’re interested in ranking of patients based on their risk score.
To build a model we need good data. In Machine Learning, the word good means a large amount of representative data. Collecting representative data is not easy and often requires designing an appropriate experiment.
The best possible scenario is that one can design and run an experiment to collect the necessary data. In less comfortable situations, we look for “natural experiments,” i.e., data that have been collected for another purpose but that can be used to build a model. Here we will use the data= collected through epidemiological interviews. There will be a lot of data points and it should be fairly representative, although unfortunately it only involves symptomatic patients who are tested positive for SARS-COV-2.
For the purposes of this exercise, I have prepared two sets of characteristics of patients infected with covid. It is important to note that these are not real patient data. This is simulated data, generated to have relationships consistent with real data (obtained from NIH), but the data itself is not real. Fortunately, they are sufficient for the purposes of our exercise.
The data is divided into two sets covid_spring and
covid_summer. The first is acquired in spring 2020 and will
be used as training data while the second dataset is acquired in summer
and will be used for validation. In machine learning, model validation
is performed on a separate data set. This controls the risk of
overfitting an elastic model to the data. If we do not have a separate
set then it is generated using cross-validation, out of sample or out of
time techniques.
covid_spring.csv corresponds to covid mortality data
from spring 2020. We will use this data for model training.covid_summer.csv corresponds to covid mortality data
from summer 2020. We will use this data for model validation.covid_spring <- read.table("data/covid_spring.csv", sep =";", header = TRUE, stringsAsFactors = TRUE)
covid_summer <- read.table("data/covid_summer.csv", sep =";", header = TRUE, stringsAsFactors = TRUE)
Before we start any serious modeling, it is worth looking at the data first. To do this, we will do a simple EDA. In R there are many tools to do data exploration, I value packages that support so called table one.
library("tableone")
table1 <- CreateTableOne(vars = colnames(covid_spring)[1:11],
data = covid_spring,
strata = "Death")
print(table1)
## Stratified by Death
## No Yes p test
## n 9487 513
## Gender = Male (%) 4554 (48.0) 271 (52.8) 0.037
## Age (mean (SD)) 44.19 (18.32) 74.44 (13.27) <0.001
## Cardiovascular.Diseases = Yes (%) 839 ( 8.8) 273 (53.2) <0.001
## Diabetes = Yes (%) 260 ( 2.7) 78 (15.2) <0.001
## Neurological.Diseases = Yes (%) 127 ( 1.3) 57 (11.1) <0.001
## Kidney.Diseases = Yes (%) 111 ( 1.2) 62 (12.1) <0.001
## Cancer = Yes (%) 158 ( 1.7) 68 (13.3) <0.001
## Hospitalization = Yes (%) 2344 (24.7) 481 (93.8) <0.001
## Fever = Yes (%) 3314 (34.9) 335 (65.3) <0.001
## Cough = Yes (%) 3062 (32.3) 253 (49.3) <0.001
## Weakness = Yes (%) 2282 (24.1) 196 (38.2) <0.001
During modeling, exploration often takes the most time. In this case, we will limit ourselves to some simple graphs.
ggplot(covid_spring, aes(Age)) +
geom_histogram() +
ggtitle("Histogram of age")
ggplot(covid_spring, aes(Age, fill = Death)) +
geom_histogram() +
ggtitle("Histogram of age")
ggplot(covid_spring, aes(Age, fill = Death)) +
geom_histogram(color = "white") +
ggtitle("Histogram of age") +
DALEX::theme_ema() +
scale_fill_manual("", values = c("grey", "red3"))
library(ggmosaic)
ggplot(data = covid_spring) +
geom_mosaic(aes(x=product(Diabetes), fill = Death)) +
DALEX::theme_ema() +
scale_fill_manual("", values = c("grey", "red3"))
library("pheatmap")
pheatmap((covid_spring[,3:11] == "Yes") + 0)
One of the most important rules to remember when building a predictive model is: Do not condition on future!
Variables like Hospitalization or Cough are
not good predictors, beacuse they are not known in advance.
covid_spring <- covid_spring[,c("Gender", "Age", "Cardiovascular.Diseases", "Diabetes",
"Neurological.Diseases", "Kidney.Diseases", "Cancer",
"Death")]
covid_summer <- covid_summer[,c("Gender", "Age", "Cardiovascular.Diseases", "Diabetes",
"Neurological.Diseases", "Kidney.Diseases", "Cancer",
"Death")]
covid_spring and
covid_summer.tableone for covid_spring and
covid_summer.DALEX package you will find
titanic_imputed dataset. Calculate tableone
for this dataset.We will think of a predictive model as a function that computes a certain prediction for certain input data. Usually, such a function is built automatically based on the data. But technically the model can be any function defined in any way. The first model will be based on statistics collected by the CDC (CDC stands for Centers for Disease Control and Prevention. You will find a set of useful statistics related to Covid mortality on this page) that determine mortality in different age groups.
In many cases, you do not need data to create a model. Just google some information about the problem.
It turns out that CDC has some decent statistics about age-related mortality. These statistics will suffice as a first approximation of our model.
Lesson 1: Often you don’t need individual data to build a good model.
What is a predictive model? We will think of it as a function that takes a set of numbers as input and returns a single number as the result - the score.
cdc_risk <- function(x, base_risk = 0.00003) {
multip <- rep(7900, nrow(x))
multip[which(x$Age < 84.5)] <- 2800
multip[which(x$Age < 74.5)] <- 1100
multip[which(x$Age < 64.5)] <- 400
multip[which(x$Age < 49.5)] <- 130
multip[which(x$Age < 39.5)] <- 45
multip[which(x$Age < 29.5)] <- 15
multip[which(x$Age < 17.5)] <- 1
multip[which(x$Age < 4.5)] <- 2
multip * base_risk
}
x <- data.frame(Age = 25, Hospitalisation = "Yes")
cdc_risk(x)
## [1] 0.00045
library("DALEX")
model_cdc <- DALEX::explain(cdc_risk,
predict_function = function(m, x) m(x),
type = "classification",
label = "CDC")
## Preparation of a new explainer is initiated
## -> model label : CDC
## -> no data available! ( WARNING )
## -> target variable : not specified! ( WARNING )
## -> predict function : function(m, x) m(x)
## -> predicted values : No value for predict function target column. ( default )
## -> model_info : package Model of class: function package unrecognized , ver. Unknown , task regression ( default )
## -> model_info : type set to classification
## -> model_info : Model info detected classification task but 'y' is a NULL . ( WARNING )
## -> model_info : By deafult classification tasks supports only numercical 'y' parameter.
## -> model_info : Consider changing to numerical vector with 0 and 1 values.
## -> model_info : Otherwise I will not be able to calculate residuals or loss function.
## -> residual function : difference between y and yhat ( default )
## A new explainer has been created!
predict(model_cdc, x)
## [1] 0.00045
The same function can be written in a slightly more compact form as (now it works on many rows)
cdc_risk <- function(x, base_risk = 0.00003) {
bin <- cut(x$Age, c(-Inf, 4.5, 17.5, 29.5, 39.5, 49.5, 64.5, 74.5, 84.5, Inf))
relative_risk <- c(2, 1, 15, 45, 130, 400, 1100, 2800, 7900)[as.numeric(bin)]
relative_risk * base_risk
}
# check it
x <- data.frame(Age = c(25,45,85))
cdc_risk(x)
## [1] 0.00045 0.00390 0.23700
summary(cdc_risk(covid_spring))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00003 0.00135 0.00390 0.01849 0.01200 0.23700
table(Death = covid_spring$Death,
Prediction.above.005 = cdc_risk(covid_spring) > 0.05)
## Prediction.above.005
## Death FALSE TRUE
## No 8946 541
## Yes 237 276
In R, we have many tools for creating models. The problem with them is that these tools are created by different people and return results in different structures. So in order to work uniformly with the models we need to package the model in such a way that it has a uniform interface.
Different models have different APIs.
But you need One API to Rule Them All!
The DALEX library provides a unified architecture to explore and validate models using different analytical methods.
library("DALEX")
model_cdc <- DALEX::explain(cdc_risk,
predict_function = function(m, x) m(x),
data = covid_summer,
y = covid_summer$Death == "Yes",
type = "classification",
label = "CDC")
## Preparation of a new explainer is initiated
## -> model label : CDC
## -> data : 10000 rows 8 cols
## -> target variable : 10000 values
## -> predict function : function(m, x) m(x)
## -> predicted values : No value for predict function target column. ( default )
## -> model_info : package Model of class: function package unrecognized , ver. Unknown , task regression ( default )
## -> model_info : type set to classification
## -> model_info : Model info detected classification task but 'y' is a logical . Converted to numeric. ( NOTE )
## -> predicted values : numerical, min = 3e-05 , mean = 0.01480215 , max = 0.237
## -> residual function : difference between y and yhat ( default )
## -> residuals : numerical, min = -0.237 , mean = 0.008497847 , max = 0.99955
## A new explainer has been created!
predict(model_cdc, x)
## [1] 0.00045 0.00390 0.23700
The evaluation of the model performance for the classification is based on different measures than for the regression.
For regression, commonly used measures are Mean squared error MSE
\[MSE(f) = \frac{1}{n} \sum_{i}^{n} (f(x_i) - y_i)^2 \]
and Rooted mean squared error RMSE
\[RMSE(f) = \sqrt{MSE(f, X, y)} \]
For classification, commonly used measures are Accuracy
\[ACC(f) = (TP + TN)/n\]
Precision
\[Prec(f) = TP/(TP + FP)\]
and Recall
\[Recall(f) = TP/(TP + FN)\]
and F1 score
\[F1(f) = 2\frac{Prec(f) * Recall(f) }{Prec(f) + Recall(f)}\]
In this problem we are interested in ranking of scores, so we will use the AUC measure (the area under the ROC curve).
There are many measures for evaluating predictive models and they are
located in various R packages (ROCR, measures,
mlr3measures, etc.). For simplicity, in this example, we
use only the AUC measure from the DALEX package.
Pregnancy: Sensitivity and Specificity
http://getthediagnosis.org/diagnosis/Pregnancy.htm
https://en.wikipedia.org/wiki/Sensitivity_and_specificity
For AUC the cutoff does not matter. But we set it to get
nice precision and F1.
Model performance
Model exploration starts with an assessment of how good is the model.
The DALEX::model_performance function calculates a set of
the most common measures for the specified model.
library("DALEX")
model_cdc <- DALEX::explain(cdc_risk,
predict_function = function(m, x) m(x),
type = "classification",
label = "CDC")
## Preparation of a new explainer is initiated
## -> model label : CDC
## -> no data available! ( WARNING )
## -> target variable : not specified! ( WARNING )
## -> predict function : function(m, x) m(x)
## -> predicted values : No value for predict function target column. ( default )
## -> model_info : package Model of class: function package unrecognized , ver. Unknown , task regression ( default )
## -> model_info : type set to classification
## -> model_info : Model info detected classification task but 'y' is a NULL . ( WARNING )
## -> model_info : By deafult classification tasks supports only numercical 'y' parameter.
## -> model_info : Consider changing to numerical vector with 0 and 1 values.
## -> model_info : Otherwise I will not be able to calculate residuals or loss function.
## -> residual function : difference between y and yhat ( default )
## A new explainer has been created!
predict(model_cdc, x)
## [1] 0.00045 0.00390 0.23700
model_cdc <- update_data(model_cdc,
data = covid_summer[,-8],
y = covid_summer$Death == "Yes")
## -> data : 10000 rows 7 cols
## -> target variable : 10000 values
## [32m An explainer has been updated! [39m
predict(model_cdc, x)
## [1] 0.00045 0.00390 0.23700
#library(ROCR)
mp_cdc <- model_performance(model_cdc, cutoff = 0.1)
mp_cdc
## Measures for: classification
## recall : 0.2188841
## precision : 0.2602041
## f1 : 0.2377622
## accuracy : 0.9673
## auc : 0.9042275
##
## Residuals:
## 0% 10% 20% 30% 40% 50% 60% 70%
## -0.23700 -0.03300 -0.01200 -0.01200 -0.00390 -0.00390 -0.00135 -0.00135
## 80% 90% 100%
## -0.00045 -0.00006 0.99955
Note: The model is evaluated on the data given in the explainer. Use
DALEX::update_data() to specify another dataset.
Note: Explainer knows whether the model is for classification or regression, so it automatically selects the right measures. It can be overridden if needed.
The S3 generic plot function draws a graphical summary
of the model performance. With the geom argument, one can
determine the type of chart.
plot(mp_cdc, geom = "roc") + DALEX::theme_ema()
covid_spring
data.covid_spring and
covid_summer data.DALEX package you will find
titanic_imputed dataset. Build a similar model to CDC but
for the Titanic dataset. How good is your model?Usually, we don’t know which function is the best for our problem. This is why we want to use data to find/train such function with the use of some automated algorithm.
In the Machine Learning, there are hundreds of algorithms available. Usually, this training boils down to finding parameters for some family of models. One of the most popular families of models is decision trees. Their great advantage is the transparency of their structure.
We will begin building the model by constructing a decision tree. We will stepwise control the complexity of the model.
library("partykit")
tree1 <- ctree(Death ~., covid_spring,
control = ctree_control(maxdepth = 1))
plot(tree1)
tree2 <- ctree(Death ~., covid_spring,
control = ctree_control(maxdepth = 2))
plot(tree2)
tree3 <- ctree(Death ~., covid_spring,
control = ctree_control(maxdepth = 3))
plot(tree3)
tree <- ctree(Death ~., covid_spring,
control = ctree_control(alpha = 0.0001))
plot(tree)
To work with different models uniformly, we will also wrap this one into an explainer.
model_tree <- DALEX::explain(tree,
predict_function = function(m, x) predict(m, x, type = "prob")[,2],
data = covid_summer[,-8],
y = covid_summer$Death == "Yes",
type = "classification",
label = "Tree",
verbose = FALSE)
mp_tree <- model_performance(model_tree, cutoff = 0.1)
mp_tree
## Measures for: classification
## recall : 0.8626609
## precision : 0.1492205
## f1 : 0.2544304
## accuracy : 0.8822
## auc : 0.9126477
##
## Residuals:
## 0% 10% 20% 30% 40% 50%
## -0.462686567 -0.199029126 -0.007781621 -0.007781621 -0.007781621 -0.007781621
## 60% 70% 80% 90% 100%
## -0.007781621 -0.007781621 -0.007781621 -0.007781621 0.992218379
plot(mp_tree, geom = "roc")
hist(predict(model_tree, covid_summer))
plot(mp_tree, mp_cdc, geom = "roc")
covid_spring
data.covid_spring and
covid_summer data.Example for titanic
head(titanic_imputed)
tree <- ctree(survived ~., titanic_imputed)
plot(tree)
tree <- ctree(factor(survived) ~., titanic_imputed,
control = ctree_control(alpha = 0.0001))
plot(tree)
model_tree <- DALEX::explain(tree,
predict_function = function(m, x) predict(m, x, type = "prob")[,2],
data = titanic_imputed,
y = titanic_imputed$survived == 1,
type = "classification",
label = "Tree",
verbose = FALSE)
mp <- model_performance(model_tree)
mp
plot(mp, geom = "roc")
Decision trees are models that have low bias but high variance. In 2001, Leo Breiman proposed a new family of models, called a random forest, which averages scores from multiple decision trees trained on bootstrap samples of the data. The whole algorithm is a bit more complex but also very fascinating. You can read about it at https://tinyurl.com/RF2001. Nowadays a very popular, in a sense complementary technique for improving models is boosting, in which you reduce the model load at the expense of variance. This algorithm reduces variance at the expense of bias. Quite often it leads to a better model.
We will train a random forest with the mlr3 library. The
first step is to define the prediction task. More info
library("mlr3")
covid_task <- TaskClassif$new(id = "covid_spring",
backend = covid_spring,
target = "Death",
positive = "Yes")
covid_task
## <TaskClassif:covid_spring> (10000 x 8)
## * Target: Death
## * Properties: twoclass
## * Features (7):
## - fct (6): Cancer, Cardiovascular.Diseases, Diabetes, Gender,
## Kidney.Diseases, Neurological.Diseases
## - int (1): Age
Now we need to define the family of models in which we want to look
for a solution. The random forests is specified by the
classif.ranger" parameter. To find the best model in this
family we use the train().
library("mlr3learners")
library("ranger")
covid_ranger <- lrn("classif.ranger", predict_type = "prob",
num.trees = 25)
covid_ranger
## <LearnerClassifRanger:classif.ranger>
## * Model: -
## * Parameters: num.threads=1, num.trees=25
## * Packages: mlr3, mlr3learners, ranger
## * Predict Types: response, [prob]
## * Feature Types: logical, integer, numeric, character, factor, ordered
## * Properties: hotstart_backward, importance, multiclass, oob_error,
## twoclass, weights
covid_ranger$train(covid_task)
covid_ranger$model
## Ranger result
##
## Call:
## ranger::ranger(dependent.variable.name = task$target_names, data = task$data(), probability = self$predict_type == "prob", case.weights = task$weights$weight, num.threads = 1L, num.trees = 25L)
##
## Type: Probability estimation
## Number of trees: 25
## Sample size: 10000
## Number of independent variables: 7
## Mtry: 2
## Target node size: 10
## Variable importance mode: none
## Splitrule: gini
## OOB prediction error (Brier s.): 0.03760708
predict(covid_ranger, covid_summer[1:3,], predict_type = "prob")[,1]
## [1] 0.007847323 0.011142358 0.195194454
A trained model can be turned into an explainer. Simpler functions can be used to calculate the performance of this model. But using explainers has an advantage that will be seen in all its beauty in just two pages.
model_ranger <- explain(covid_ranger,
predict_function = function(m,x)
predict(m, x, predict_type = "prob")[,1],
data = covid_summer[,-8],
y = covid_summer$Death == "Yes",
type = "classification",
label = "Ranger",
verbose = FALSE)
mp_ranger <- model_performance(model_ranger)
mp_ranger
## Measures for: classification
## recall : 0.111588
## precision : 0.4482759
## f1 : 0.1786942
## accuracy : 0.9761
## auc : 0.942663
##
## Residuals:
## 0% 10% 20% 30% 40% 50%
## -0.633536505 -0.129990764 -0.018303930 -0.011174022 -0.011142358 -0.011065548
## 60% 70% 80% 90% 100%
## -0.009678616 -0.007173015 -0.007121335 -0.007089671 0.992878665
plot(mp_ranger, geom = "roc")
plot(mp_ranger, mp_tree, mp_cdc, geom = "roc")
covid_spring
data.covid_spring and
covid_summer data.DALEX package you will find
titanic_imputed dataset. Build a tree based model for the
Titanic dataset. How good is your model?Hyperparameter Optimisation
Machine Learning algorithms typically have many hyperparameters that determine how the model is to be trained. For models with high variance, the selection of such hyperparameters has a strong impact on the quality of the final solution. The mlr3tuning package contains procedures to automate the process of finding good hyperparameters.
See: https://mlr3book.mlr-org.com/tuning.html.
To use it, you must specify the space of hyperparameter to search. Not all hyperparameters are worth optimizing. In the example below, we focus on four for the random forest algorithm.
For automatic hyperparameter search, it is necessary to specify a few more elements: (1) a stopping criterion, below it is the number of 10 evaluations, (2) a search strategy for the parameter space, below it is a random search, (3) a way to evaluate the performance of the proposed models, below it is the AUC determined by 5-fold cross-validation.
In order to be able to automatically search for optimal parameters, it is first necessary to specify what is the space of possible hyperparameters.
library("mlr3tuning")
library("paradox")
covid_ranger$param_set
## <ParamSet>
## id class lower upper nlevels default
## 1: alpha ParamDbl -Inf Inf Inf 0.5
## 2: always.split.variables ParamUty NA NA Inf <NoDefault[3]>
## 3: class.weights ParamUty NA NA Inf
## 4: holdout ParamLgl NA NA 2 FALSE
## 5: importance ParamFct NA NA 4 <NoDefault[3]>
## 6: keep.inbag ParamLgl NA NA 2 FALSE
## 7: max.depth ParamInt 0 Inf Inf
## 8: min.node.size ParamInt 1 Inf Inf
## 9: min.prop ParamDbl -Inf Inf Inf 0.1
## 10: minprop ParamDbl -Inf Inf Inf 0.1
## 11: mtry ParamInt 1 Inf Inf <NoDefault[3]>
## 12: mtry.ratio ParamDbl 0 1 Inf <NoDefault[3]>
## 13: num.random.splits ParamInt 1 Inf Inf 1
## 14: num.threads ParamInt 1 Inf Inf 1
## 15: num.trees ParamInt 1 Inf Inf 500
## 16: oob.error ParamLgl NA NA 2 TRUE
## 17: regularization.factor ParamUty NA NA Inf 1
## 18: regularization.usedepth ParamLgl NA NA 2 FALSE
## 19: replace ParamLgl NA NA 2 TRUE
## 20: respect.unordered.factors ParamFct NA NA 3 ignore
## 21: sample.fraction ParamDbl 0 1 Inf <NoDefault[3]>
## 22: save.memory ParamLgl NA NA 2 FALSE
## 23: scale.permutation.importance ParamLgl NA NA 2 FALSE
## 24: se.method ParamFct NA NA 2 infjack
## 25: seed ParamInt -Inf Inf Inf
## 26: split.select.weights ParamUty NA NA Inf
## 27: splitrule ParamFct NA NA 3 gini
## 28: verbose ParamLgl NA NA 2 TRUE
## 29: write.forest ParamLgl NA NA 2 TRUE
## id class lower upper nlevels default
## parents value
## 1:
## 2:
## 3:
## 4:
## 5:
## 6:
## 7:
## 8:
## 9:
## 10:
## 11:
## 12:
## 13: splitrule
## 14: 1
## 15: 25
## 16:
## 17:
## 18:
## 19:
## 20:
## 21:
## 22:
## 23: importance
## 24:
## 25:
## 26:
## 27:
## 28:
## 29:
## parents value
search_space = ps(
num.trees = p_int(lower = 50, upper = 500),
max.depth = p_int(lower = 1, upper = 10),
mtry = p_int(lower = 1, upper = 7),
minprop = p_dbl(lower = 0.01, upper = 0.1),
splitrule = p_fct(levels = c("gini", "extratrees"))
)
search_space
## <ParamSet>
## id class lower upper nlevels default value
## 1: num.trees ParamInt 50.00 500.0 451 <NoDefault[3]>
## 2: max.depth ParamInt 1.00 10.0 10 <NoDefault[3]>
## 3: mtry ParamInt 1.00 7.0 7 <NoDefault[3]>
## 4: minprop ParamDbl 0.01 0.1 Inf <NoDefault[3]>
## 5: splitrule ParamFct NA NA 2 <NoDefault[3]>
Popular searching strategies are random_search and
grid_search. Termination is set fo a specific number of
evaluations. Internal testing is based on 5-fold CV.
tuned_ranger = AutoTuner$new(
learner = covid_ranger,
resampling = rsmp("cv", folds = 5),
measure = msr("classif.auc"),
search_space = search_space,
terminator = trm("evals", n_evals = 10),
tuner = tnr("random_search")
)
tuned_ranger
## <AutoTuner:classif.ranger.tuned>
## * Model: list
## * Search Space:
## <ParamSet>
## id class lower upper nlevels default value
## 1: num.trees ParamInt 50.00 500.0 451 <NoDefault[3]>
## 2: max.depth ParamInt 1.00 10.0 10 <NoDefault[3]>
## 3: mtry ParamInt 1.00 7.0 7 <NoDefault[3]>
## 4: minprop ParamDbl 0.01 0.1 Inf <NoDefault[3]>
## 5: splitrule ParamFct NA NA 2 <NoDefault[3]>
## * Packages: mlr3, mlr3tuning, mlr3learners, ranger
## * Predict Type: prob
## * Feature Types: logical, integer, numeric, character, factor, ordered
## * Properties: hotstart_backward, importance, multiclass, oob_error,
## twoclass, weights
tuned_ranger$train(covid_task)
tuned_ranger$tuning_result
## num.trees max.depth mtry minprop splitrule learner_param_vals x_domain
## 1: 330 5 3 0.06028733 gini <list[6]> <list[5]>
## classif.auc
## 1: 0.9261972
tuned_ranger$predict_newdata(newdata = covid_spring)$prob[1:4,]
## Yes No
## [1,] 0.006292130 0.9937079
## [2,] 0.006394404 0.9936056
## [3,] 0.006316266 0.9936837
## [4,] 0.006316266 0.9936837
model_tuned <- explain(tuned_ranger,
predict_function = function(m,x)
m$predict_newdata(newdata = x)$prob[,1],
data = covid_summer[,-8],
y = covid_summer$Death == "Yes",
type = "classification",
label = "AutoTune",
verbose = FALSE)
mp_tuned <- model_performance(model_tuned)
mp_tuned
## Measures for: classification
## recall : 0.02575107
## precision : 0.4
## f1 : 0.0483871
## accuracy : 0.9764
## auc : 0.9462537
##
## Residuals:
## 0% 10% 20% 30% 40% 50%
## -0.578273426 -0.133212721 -0.018550461 -0.007368916 -0.006316266 -0.006292130
## 60% 70% 80% 90% 100%
## -0.006247394 -0.005615708 -0.005595415 -0.005526543 0.994404585
plot(mp_tuned, geom = "roc")
plot(mp_ranger, mp_tree, mp_cdc, mp_tuned, geom = "roc")
do.call(rbind,
list(cdc = mp_cdc$measures,
tree = mp_tree$measures,
ranger = mp_ranger$measures,
tuned = mp_tuned$measures))
## recall precision f1 accuracy auc
## cdc 0.2188841 0.2602041 0.2377622 0.9673 0.9042275
## tree 0.8626609 0.1492205 0.2544304 0.8822 0.9126477
## ranger 0.111588 0.4482759 0.1786942 0.9761 0.942663
## tuned 0.02575107 0.4 0.0483871 0.9764 0.9462537
covid_spring
data.covid_spring and
covid_summer data.DALEX package you will find
titanic_imputed dataset. Optimize a tree based model for
the Titanic dataset. How good is your model?We will devote the second day entirely to talking about methods for model exploration.
Some models have built-in methods for assessment of Variable importance. For linear models one can use standardized model coefficients or p-values. For random forest one can use out-of-bag classification error. For tree boosting models one can use gain statistics. Yet, problem with such measures is that not all models have build-in variable importance statistics (e.g. neural networks) and that scores between differetnt models cannot be directly compared (how to compare gains with p-values).
This is why we need a model agnostic approach that will be comparable between different models. The procedure described below is universal, model agnostic and does not depend on the model structure.
The procedure is based on variable perturbations in the validation data. If a variable is important in a model, then after its permutation the model predictions should be less accurate.
The permutation-based variable-importance of a variable \(i\) is the difference between the model performance for the original data and the model performance measured on data with the permutated variable \(i\)
\[ VI(i) = L(f, X^{perm(i)}, y) - L(f, X, y) \]
where \(L(f, X, y)\) is the value of loss function for original data \(X\), true labels \(y\) and model \(f\), while \(X^{perm(i)}\) is dataset \(x\) with \(i\)-th variable permutated.
Which performance measure should you choose? It’s up to you. In the
DALEX library, by default, RMSE is used for regression and
1-AUC for classification problems. But you can change the loss function
by specifying the argument.
mpart_ranger <- model_parts(model_ranger)
mpart_ranger
## variable mean_dropout_loss label
## 1 _full_model_ 0.05594636 Ranger
## 2 Diabetes 0.05694792 Ranger
## 3 Neurological.Diseases 0.05704582 Ranger
## 4 Kidney.Diseases 0.05831538 Ranger
## 5 Gender 0.06466944 Ranger
## 6 Cancer 0.06792829 Ranger
## 7 Cardiovascular.Diseases 0.06893274 Ranger
## 8 Age 0.19411873 Ranger
## 9 _baseline_ 0.50588311 Ranger
plot(mpart_ranger, show_boxplots = FALSE, bar_width=4) +
DALEX:::theme_ema_vertical() +
theme( axis.text = element_text(color = "black", size = 12, hjust = 0)) +
ggtitle("Variable importance","")
mpart_ranger <- model_parts(model_ranger, type = "difference")
mpart_ranger
## variable mean_dropout_loss label
## 1 _full_model_ 0.0000000000 Ranger
## 2 Neurological.Diseases 0.0009080685 Ranger
## 3 Kidney.Diseases 0.0020899909 Ranger
## 4 Cancer 0.0040778353 Ranger
## 5 Diabetes 0.0065831715 Ranger
## 6 Gender 0.0090921666 Ranger
## 7 Cardiovascular.Diseases 0.0191041113 Ranger
## 8 Age 0.1465520752 Ranger
## 9 _baseline_ 0.4158748814 Ranger
plot(mpart_ranger, show_boxplots = FALSE, bar_width=4) +
DALEX:::theme_ema_vertical() +
theme( axis.text = element_text(color = "black", size = 12, hjust = 0)) +
ggtitle("Variable importance","")
mpart_cdc <- model_parts(model_cdc)
mpart_tree <- model_parts(model_tree)
mpart_ranger <- model_parts(model_ranger)
mpart_tuned <- model_parts(model_tuned)
plot(mpart_cdc, mpart_tree, mpart_ranger, mpart_tuned, show_boxplots = FALSE, bar_width=4) +
DALEX:::theme_ema_vertical() +
theme( axis.text = element_text(color = "black", size = 12, hjust = 0)) +
ggtitle("Variable importance","") +
facet_wrap(~label, ncol = 2, scales = "free_y")
plot(mpart_cdc, mpart_tree, mpart_ranger, mpart_tuned, show_boxplots = FALSE, bar_width=4) +
DALEX:::theme_ema_vertical() +
theme( axis.text = element_text(color = "black", size = 12, hjust = 0)) +
ggtitle("Variable importance","")
covid_summer with results on the
covid_spring data.DALEX package you will find
titanic_imputed dataset. Train a ranger model and calculate
variable importance.Partial dependence profiles are averages from CP profiles for all (or a large enough number) observations.
The model_profiles() function calculates PD profiles for
a~specified model and variables (all by default).
mprof_cdc <- model_profile(model_cdc, "Age")
plot(mprof_cdc)
mgroup_ranger <- model_profile(model_ranger, variable_splits = list(Age = 0:100))
plot(mgroup_ranger)+
DALEX:::theme_ema() +
theme( axis.text = element_text(color = "black", size = 12, hjust = 0)) +
ggtitle("PD profile","")
Grouped partial dependence profiles
By default, the average is calculated for all observations. But with
the argument groups= one can specify a factor variable in
which CP profiles will be averaged.
mgroup_ranger <- model_profile(model_ranger, variable_splits = list(Age = 0:100), groups = "Diabetes")
plot(mgroup_ranger)+
DALEX:::theme_ema() +
theme( axis.text = element_text(color = "black", size = 12, hjust = 0)) +
ggtitle("PD profiles for groups","") + ylab("") + theme(legend.position = "top")
mgroup_ranger <- model_profile(model_ranger, variable_splits = list(Age = 0:100), groups = "Cardiovascular.Diseases")
plot(mgroup_ranger)+
DALEX:::theme_ema() +
theme( axis.text = element_text(color = "black", size = 12, hjust = 0)) +
ggtitle("PDP variable profile","") + ylab("") + theme(legend.position = "top")
mgroup_ranger <- model_profile(model_ranger, "Age", k = 3, center = TRUE)
plot(mgroup_ranger)+
DALEX:::theme_ema() +
theme( axis.text = element_text(color = "black", size = 12, hjust = 0)) +
ggtitle("PD profiles for segments","")
mgroup_ranger <- model_profile(model_ranger, "Age", groups = "Cardiovascular.Diseases")
plot(mgroup_ranger)+
DALEX:::theme_ema() +
theme( axis.text = element_text(color = "black", size = 12, hjust = 0)) +
ggtitle("Variable profile","")
mgroup_ranger <- model_profile(model_ranger, variable_splits = list(Age = 0:100), groups = "Cardiovascular.Diseases")
plot(mgroup_ranger, geom = "profiles")
plot(mgroup_ranger, geom = "points")
mprof_cdc <- model_profile(model_cdc, variable_splits = list(Age=0:100))
mprof_tree <- model_profile(model_tree, variable_splits = list(Age=0:100))
mprof_ranger <- model_profile(model_ranger, variable_splits = list(Age=0:100))
mprof_tuned <- model_profile(model_tuned, variable_splits = list(Age=0:100))
Profiles can be then drawn with the plot() function.
plot(mprof_tuned, mprof_cdc, mprof_tree, mprof_ranger) +
ggtitle("","") +
DALEX:::theme_ema() +
theme( axis.text = element_text(color = "black", size = 12, hjust = 0), legend.position = "top")
If the model is additive, all CP profiles are parallel. But if the model has interactions, CP profiles may have different shapes for different observations. Defining the k argument allows to find and calculate the average in k segments of CP profiles.
PDP profiles do not take into account the correlation structure
between the variables. For correlated variables, the Ceteris paribus
assumption may not make sense. The model_profile function
can also calculate other types of aggregates, such as marginal profiles
and accumulated local profiles. To do this, specify the argument
type= for "conditional" or
"accumulated".
covid_summer with results on the
covid_spring data.DALEX package you will find
titanic_imputed dataset. Train a ranger model and plot
variable profiles.Once we calculate the model prediction, the question often arises which variables had the greatest impact on it.
For linear models it is easy to assess the impact of individual variables because there is one coefficient for each variable.
Steve <- data.frame(Gender = factor("Male", c("Female", "Male")),
Age = 76,
Cardiovascular.Diseases = factor("Yes", c("No", "Yes")),
Diabetes = factor("No", c("No", "Yes")),
Neurological.Diseases = factor("No", c("No", "Yes")),
Kidney.Diseases = factor("No", c("No", "Yes")),
Cancer = factor("No", c("No", "Yes")))
predict(model_ranger, Steve)
## Yes
## 0.3219832
Steve
## Gender Age Cardiovascular.Diseases Diabetes Neurological.Diseases
## 1 Male 76 Yes No No
## Kidney.Diseases Cancer
## 1 No No
It turns out that such attributions can be calculated for any
predictive model. The most popular model agnostic method is Shapley
values. They may be calculated with a predict_parts()
function.
ppart_cdc <- predict_parts(model_cdc, Steve, type = "shap")
plot(ppart_cdc)
ppart_tree <- predict_parts(model_tree, Steve, type = "shap")
plot(ppart_tree)
ppart_tree <- predict_parts(model_tree, Steve)
plot(ppart_tree)
ppart_ranger <- predict_parts(model_ranger, Steve, type = "shap")
pl1 <- plot(ppart_ranger) + ggtitle("Shapley values for Ranger")+
DALEX:::theme_ema_vertical() +
theme( axis.text = element_text(color = "black", size = 12, hjust = 0))
ppart_ranger <- predict_parts(model_ranger, Steve)
pl2 <- plot(ppart_ranger) + ggtitle("Break-down for Ranger")+
DALEX:::theme_ema_vertical() +
theme( axis.text = element_text(color = "black", size = 12, hjust = 0))
library("patchwork")
pl1 + (pl2 + scale_y_continuous("prediction", limits = c(0,0.4)))
ppart_ranger <- predict_parts(model_ranger, Steve,
keep_distributions = TRUE)
plot(ppart_ranger, plot_distributions = TRUE) + ggtitle("Consecutive conditoning for Ranger")+
DALEX:::theme_ema_vertical() +
theme( axis.text = element_text(color = "black", size = 12, hjust = 0))
ppart_tuned <- predict_parts(model_tuned, Steve, type = "shap")
plot(ppart_tuned)
The show_boxplots argument allows you to highlight the
stability bars of the estimated attributions.
Other possible values of the type argument are
oscillations, shap, break_down,
break_down_interactions.
With order one can force a certain sequence of
variables.
By default, functions such as model_parts,
predict_parts, model_profiles do not calculate
statistics on the entire data set, but on n_samples of
random cases, and the entire procedure is repeated B times
to estimate the error bars.
ppart_cdc <- predict_parts(model_cdc, Steve)
plot(ppart_cdc)
ppart_tuned <- predict_parts(model_tuned, Steve)
plot(ppart_tuned)
covid_summer with results on the
covid_spring data.DALEX package you will find
titanic_imputed dataset. Train a ranger model and calculate
local variable attribution.Profile for a single prediction
Ceteris Paribus is a Latin phrase for “other things being equal.
Ceteris-paribus profiles show how the model response would change for a~selected observation if one of the coordinates of that observation were changed while leaving the other coordinates unchanged.
The predict_profiles() function calculated CP profiles
for a selected observation, model and vector of variables (all
continuous variables by default).
mprof_cdc <- predict_profile(model_cdc, Steve, "Age")
plot(mprof_cdc)
CP profiles can be visualized with the generic function.
For technical reasons, quantitative and qualitative variables cannot be shown in a single chart. So if you want to show the importance of quality variables you need to plot them separately.
mprof_cdc <- predict_profile(model_cdc, variable_splits = list(Age=0:100), Steve)
mprof_tree <- predict_profile(model_tree, variable_splits = list(Age=0:100), Steve)
mprof_ranger <- predict_profile(model_ranger, variable_splits = list(Age=0:100), Steve)
mprof_tuned <- predict_profile(model_tuned, variable_splits = list(Age=0:100), Steve)
plot(mprof_tuned)
plot(mprof_tuned, mprof_cdc, mprof_tree, mprof_ranger)
mprof_tuned <- predict_profile(model_tuned, variables = "Age", Steve)
pl1 <- plot(mprof_tuned) + ggtitle("Ceteris paribus for Ranger")+
DALEX:::theme_ema() + scale_y_continuous("prediction", limits = c(0,0.55)) +
theme( axis.text = element_text(color = "black", size = 12, hjust = 0))
mprof_tuned2 <- predict_profile(model_tuned, variables = "Cardiovascular.Diseases", Steve)
pl2 <- plot(mprof_tuned2, variable_type = "categorical", variables = "Cardiovascular.Diseases", categorical_type = "lines") + ggtitle("Ceteris paribus for Ranger")+
DALEX:::theme_ema() + scale_y_continuous("prediction", limits = c(0,0.55)) +
theme( axis.text = element_text(color = "black", size = 12, hjust = 0))
pl1 + pl2
plot(mprof_tuned, mprof_cdc, mprof_tree, mprof_ranger) + ggtitle("Ceteris paribus for Steve")+
DALEX:::theme_ema() +
theme( axis.text = element_text(color = "black", size = 12, hjust = 0))
Local importance of variables can be measured as oscillations of CP
plots. The greater the variability of the CP profile, the more important
is the variable. Set type = "oscillations" in the
predict_parts function.
covid_summer with results on the
covid_spring data.DALEX package you will find
titanic_imputed dataset. Train a ranger model and calculate
local profiles.Play with your model!
library("modelStudio")
ms <- modelStudio(model_ranger, new_observation = Steve)
ms
library("arenar")
library("dplyr")
rownames(Steve) = c("Steve")
covid_ar <- create_arena(live = TRUE) %>%
push_model(model_cdc) %>%
push_model(model_tree) %>%
push_model(model_ranger) %>%
push_model(model_tuned) %>%
push_observations(Steve)
run_server(covid_ar)
devtools::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 4.3.1 (2023-06-16)
## os macOS Sonoma 14.0
## system aarch64, darwin20
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Europe/Amsterdam
## date 2023-10-11
## pandoc 3.1.2 @ /opt/homebrew/bin/ (via rmarkdown)
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date (UTC) lib source
## backports 1.4.1 2021-12-13 [2] CRAN (R 4.3.0)
## bbotk 0.7.2 2022-12-08 [1] CRAN (R 4.3.0)
## bslib 0.5.1 2023-08-11 [1] CRAN (R 4.3.0)
## cachem 1.0.8 2023-05-01 [2] CRAN (R 4.3.0)
## callr 3.7.3 2022-11-02 [2] CRAN (R 4.3.0)
## checkmate 2.2.0 2023-04-27 [2] CRAN (R 4.3.0)
## class 7.3-22 2023-05-03 [2] CRAN (R 4.3.1)
## cli 3.6.1 2023-03-23 [2] CRAN (R 4.3.0)
## codetools 0.2-19 2023-02-01 [2] CRAN (R 4.3.1)
## colorspace 2.1-0 2023-01-23 [2] CRAN (R 4.3.0)
## crayon 1.5.2 2022-09-29 [2] CRAN (R 4.3.0)
## DALEX * 2.4.3 2023-01-15 [2] CRAN (R 4.3.0)
## data.table 1.14.8 2023-02-17 [2] CRAN (R 4.3.0)
## DBI 1.1.3 2022-06-18 [1] CRAN (R 4.3.0)
## devtools 2.4.5 2022-10-11 [1] CRAN (R 4.3.0)
## digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.0)
## dplyr 1.1.3 2023-09-03 [2] CRAN (R 4.3.0)
## e1071 1.7-13 2023-02-01 [1] CRAN (R 4.3.0)
## ellipsis 0.3.2 2021-04-29 [2] CRAN (R 4.3.0)
## evaluate 0.21 2023-05-05 [2] CRAN (R 4.3.0)
## fansi 1.0.4 2023-01-22 [2] CRAN (R 4.3.0)
## farver 2.1.1 2022-07-06 [2] CRAN (R 4.3.0)
## fastmap 1.1.1 2023-02-24 [2] CRAN (R 4.3.0)
## forcats 1.0.0 2023-01-29 [2] CRAN (R 4.3.0)
## Formula 1.2-5 2023-02-24 [1] CRAN (R 4.3.0)
## fs 1.6.3 2023-07-20 [2] CRAN (R 4.3.0)
## future 1.33.0 2023-07-01 [1] CRAN (R 4.3.0)
## future.apply 1.11.0 2023-05-21 [1] CRAN (R 4.3.0)
## generics 0.1.3 2022-07-05 [2] CRAN (R 4.3.0)
## ggmosaic * 0.3.3 2021-02-23 [1] CRAN (R 4.3.0)
## ggplot2 * 3.4.3 2023-08-14 [1] CRAN (R 4.3.0)
## ggrepel 0.9.3 2023-02-03 [1] CRAN (R 4.3.0)
## globals 0.16.2 2022-11-21 [1] CRAN (R 4.3.0)
## glue 1.6.2 2022-02-24 [2] CRAN (R 4.3.0)
## gtable 0.3.4 2023-08-21 [2] CRAN (R 4.3.0)
## haven 2.5.3 2023-06-30 [2] CRAN (R 4.3.0)
## highr 0.10 2022-12-22 [2] CRAN (R 4.3.0)
## hms 1.1.3 2023-03-21 [2] CRAN (R 4.3.0)
## htmltools 0.5.6 2023-08-10 [2] CRAN (R 4.3.0)
## htmlwidgets 1.6.2 2023-03-17 [2] CRAN (R 4.3.0)
## httpuv 1.6.11 2023-05-11 [2] CRAN (R 4.3.0)
## httr 1.4.7 2023-08-15 [2] CRAN (R 4.3.0)
## iBreakDown 2.0.1 2021-05-07 [2] CRAN (R 4.3.0)
## ingredients 2.3.0 2023-01-15 [2] CRAN (R 4.3.0)
## inum 1.0-5 2023-03-09 [1] CRAN (R 4.3.0)
## jquerylib 0.1.4 2021-04-26 [2] CRAN (R 4.3.0)
## jsonlite 1.8.7 2023-06-29 [1] CRAN (R 4.3.0)
## knitr 1.43 2023-05-25 [2] CRAN (R 4.3.0)
## labeling 0.4.3 2023-08-29 [2] CRAN (R 4.3.0)
## labelled 2.12.0 2023-06-21 [2] CRAN (R 4.3.0)
## later 1.3.1 2023-05-02 [2] CRAN (R 4.3.0)
## lattice 0.21-8 2023-04-05 [2] CRAN (R 4.3.1)
## lazyeval 0.2.2 2019-03-15 [1] CRAN (R 4.3.0)
## lgr 0.4.4 2022-09-05 [1] CRAN (R 4.3.0)
## libcoin * 1.0-9 2021-09-27 [1] CRAN (R 4.3.0)
## lifecycle 1.0.3 2022-10-07 [2] CRAN (R 4.3.0)
## listenv 0.9.0 2022-12-16 [1] CRAN (R 4.3.0)
## magrittr 2.0.3 2022-03-30 [2] CRAN (R 4.3.0)
## MASS 7.3-60 2023-05-04 [2] CRAN (R 4.3.1)
## Matrix 1.6-1 2023-08-14 [1] CRAN (R 4.3.0)
## memoise 2.0.1 2021-11-26 [2] CRAN (R 4.3.0)
## mime 0.12 2021-09-28 [2] CRAN (R 4.3.0)
## miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 4.3.0)
## mitools 2.4 2019-04-26 [1] CRAN (R 4.3.0)
## mlr3 * 0.16.1 2023-06-17 [1] CRAN (R 4.3.0)
## mlr3learners * 0.5.6 2023-01-06 [1] CRAN (R 4.3.0)
## mlr3measures 0.5.0 2022-08-05 [1] CRAN (R 4.3.0)
## mlr3misc 0.13.0 2023-09-20 [1] CRAN (R 4.3.1)
## mlr3tuning * 0.19.0 2023-06-26 [1] CRAN (R 4.3.0)
## munsell 0.5.0 2018-06-12 [2] CRAN (R 4.3.0)
## mvtnorm * 1.2-3 2023-08-25 [1] CRAN (R 4.3.0)
## palmerpenguins 0.1.1 2022-08-15 [1] CRAN (R 4.3.0)
## paradox * 0.11.1 2023-03-17 [1] CRAN (R 4.3.0)
## parallelly 1.36.0 2023-05-26 [1] CRAN (R 4.3.0)
## partykit * 1.2-20 2023-04-14 [1] CRAN (R 4.3.0)
## patchwork * 1.1.3 2023-08-14 [2] CRAN (R 4.3.0)
## pheatmap * 1.0.12 2019-01-04 [1] CRAN (R 4.3.0)
## pillar 1.9.0 2023-03-22 [2] CRAN (R 4.3.0)
## pkgbuild 1.4.2 2023-06-26 [1] CRAN (R 4.3.0)
## pkgconfig 2.0.3 2019-09-22 [2] CRAN (R 4.3.0)
## pkgload 1.3.2.1 2023-07-08 [1] CRAN (R 4.3.0)
## plotly 4.10.2 2023-06-03 [1] CRAN (R 4.3.0)
## plyr 1.8.8 2022-11-11 [1] CRAN (R 4.3.0)
## prettyunits 1.1.1 2020-01-24 [2] CRAN (R 4.3.0)
## processx 3.8.2 2023-06-30 [1] CRAN (R 4.3.0)
## productplots 0.1.1 2016-07-02 [1] CRAN (R 4.3.0)
## profvis 0.3.8 2023-05-02 [1] CRAN (R 4.3.0)
## promises 1.2.1 2023-08-10 [2] CRAN (R 4.3.0)
## proxy 0.4-27 2022-06-09 [1] CRAN (R 4.3.0)
## ps 1.7.5 2023-04-18 [2] CRAN (R 4.3.0)
## purrr 1.0.2 2023-08-10 [2] CRAN (R 4.3.0)
## R6 2.5.1 2021-08-19 [2] CRAN (R 4.3.0)
## ranger * 0.15.1 2023-04-03 [1] CRAN (R 4.3.0)
## RColorBrewer 1.1-3 2022-04-03 [2] CRAN (R 4.3.0)
## Rcpp 1.0.11 2023-07-06 [2] CRAN (R 4.3.0)
## remotes 2.4.2.1 2023-07-18 [2] CRAN (R 4.3.0)
## rlang 1.1.1 2023-04-28 [2] CRAN (R 4.3.0)
## rmarkdown 2.24 2023-08-14 [1] CRAN (R 4.3.0)
## rpart 4.1.19 2022-10-21 [2] CRAN (R 4.3.1)
## rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.0)
## sass 0.4.7 2023-07-15 [2] CRAN (R 4.3.0)
## scales 1.2.1 2022-08-20 [2] CRAN (R 4.3.0)
## sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.0)
## shiny 1.7.5 2023-08-12 [2] CRAN (R 4.3.0)
## stringi 1.7.12 2023-01-11 [2] CRAN (R 4.3.0)
## stringr 1.5.0 2022-12-02 [2] CRAN (R 4.3.0)
## survey 4.2-1 2023-05-03 [1] CRAN (R 4.3.0)
## survival 3.5-7 2023-08-14 [2] CRAN (R 4.3.0)
## tableone * 0.13.2 2022-04-15 [1] CRAN (R 4.3.0)
## tibble 3.2.1 2023-03-20 [2] CRAN (R 4.3.0)
## tidyr 1.3.0 2023-01-24 [2] CRAN (R 4.3.0)
## tidyselect 1.2.0 2022-10-10 [2] CRAN (R 4.3.0)
## urlchecker 1.0.1 2021-11-30 [1] CRAN (R 4.3.0)
## usethis 2.2.2 2023-07-06 [1] CRAN (R 4.3.0)
## utf8 1.2.3 2023-01-31 [2] CRAN (R 4.3.0)
## uuid 1.1-1 2023-08-17 [2] CRAN (R 4.3.0)
## vctrs 0.6.3 2023-06-14 [1] CRAN (R 4.3.0)
## viridisLite 0.4.2 2023-05-02 [2] CRAN (R 4.3.0)
## withr 2.5.0 2022-03-03 [2] CRAN (R 4.3.0)
## xfun 0.40 2023-08-09 [2] CRAN (R 4.3.0)
## xtable 1.8-4 2019-04-21 [2] CRAN (R 4.3.0)
## yaml 2.3.7 2023-01-23 [2] CRAN (R 4.3.0)
## zoo 1.8-12 2023-04-13 [1] CRAN (R 4.3.0)
##
## [1] /Users/theobakker/Library/R/arm64/4.3/library
## [2] /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library
##
## ──────────────────────────────────────────────────────────────────────────────